Using a bike instead of a car for short trips reduces the carbon footprint of that trip by about 75% [source]. Shared public systems help increase the use of bicycles for transportation. The United States Department of Transportation Bureau of Transportation Statistics (BTS) tracks bikeshare (docked and dockless) and E-scooter systems, and presents a nice summary of these micromobility systems at this site. Here is their summary video:
Let’s focus on bikeshare systems and load in data on them from BTS:
bikes_scooter<-read_csv("http://faculty.olin.edu/dshuman/DS/Bikeshare_Docked_and_Dockless_and_E-scooter_Systems_by_Year_and_City_Served.csv")%>%
janitor::clean_names()
bikeshare<-bikes_scooter%>%
select(city,state,year,dock_ct)%>%
filter(dock_ct>0)%>%
arrange(city,state,year)
| city | state | year | dock_ct |
|---|---|---|---|
| Long Beach | CA | 2021 | 86 |
| Blacksburg | VA | 2019 | 12 |
| Columbia | MD | 2018 | 9 |
| Los Angeles | CA | 2017 | 119 |
| Harrisburg | PA | 2019 | 11 |
| Fort Lauderdale | FL | 2020 | 21 |
We also have a second table from Wikipedia that has the population and land area for the 250 most populus cities in the United States, with all data from the 2020 Census:
top_250_cities_by_pop<-read_csv("http://faculty.olin.edu/dshuman/DS/us_top_250_cities_by_population.csv")%>%
janitor::clean_names()%>%
select(city,state,pop_2020,area_sq_miles)
| city | state | pop_2020 | area_sq_miles |
|---|---|---|---|
| Lincoln | NE | 291082 | 97.7 |
| Lexington | KY | 322570 | 283.6 |
| Pasadena | TX | 151950 | 43.7 |
| Boise | ID | 235684 | 84.0 |
| Victorville | CA | 134810 | 73.7 |
| Roseville | CA | 147773 | 44.1 |
Additional reading:
A join is a data verb that combines two tables.
There are several kinds of join.
A match between a case in the left table and a case in the right table is made based on the values in pairs of corresponding variables.
As an example, we’ll examine different ways to combine the bikeshare and top_250_cities_by_pop tables.
The very first question to ask is what variables the two tables have in common. In this case, it is the city and the state.
The first class of joins are mutating joins, which add new variables (columns) to the left data table from matching observations in the right table.1
The main difference in the three mutating join options in this class is how they answer the following questions:
Three mutating join functions:
left_join(): the output has all cases from the left, regardless if there is a match in the right, but discards any cases in the right that do not have a match in the left.inner_join(): the output has only the cases from the left with a match in the right.full_join(): the output has all cases from the left and the right. This is less common than the first two join operators.When there are multiple matches in the right table for a particular case in the left table, all three of these mutating join operators produce a separate case in the new table for each of the matches from the right.
One of the most common and useful mutating joins in one that translates levels of a variable to a new scale; e.g., a join that translates letter grades (e.g., “B”) into grade points (e.g., 3) for a GPA calculuation.
Example 2.1 (left_join) Let’s mutate two new columns to the bikeshare table by pulling in the population and land area from the top_250_cities_by_pop table:
bikeshare2<-bikeshare%>%
left_join(top_250_cities_by_pop,by=c("state"="state","city"="city"))
| city | state | year | dock_ct | pop_2020 | area_sq_miles |
|---|---|---|---|---|---|
| Long Beach | CA | 2021 | 86 | 466742 | 50.7 |
| Blacksburg | VA | 2019 | 12 | NA | NA |
| Columbia | MD | 2018 | 9 | NA | NA |
| Los Angeles | CA | 2017 | 119 | 3898747 | 469.5 |
| Harrisburg | PA | 2019 | 11 | NA | NA |
| Fort Lauderdale | FL | 2020 | 21 | 182760 | 34.6 |
A few notes:
We need to match both city and state, to distinguish for example between Rochester, NY and Rochester, MN.
In this case, the variable names we are matching happen to be the same, but they don’t need to be. If the right table name was ST, we’d just change that part of the match to "state"="ST".
If the by= is omitted from a join, then R will perform a natural join, which matches the two tables by all variables they have in common. In this case, that would yield the same result (but you should be careful in general):
bikeshare2<-bikeshare%>%
left_join(top_250_cities_by_pop)
## Joining with `by = join_by(city, state)`
Example 2.2 (inner_join) The bikeshare2 table has a lot of rows corresponding to cities that aren’t in the 250 most populus. and thus have NAs in the last two columns. If we want to create a table that only has information for cities that are in both tables, we can instead use inner_join:
bikeshare3<-bikeshare%>%
inner_join(top_250_cities_by_pop,by=c("state"="state","city"="city"))
| city | state | year | dock_ct | pop_2020 | area_sq_miles |
|---|---|---|---|---|---|
| Salt Lake City | UT | 2018 | 34 | 199723 | 110.3 |
| Chicago | IL | 2019 | 608 | 2746388 | 227.7 |
| Honolulu | HI | 2018 | 136 | 350964 | 60.5 |
| San Antonio | TX | 2021 | 55 | 1434625 | 498.8 |
| New York | NY | 2021 | 1566 | 8804190 | 300.5 |
| Milwaukee | WI | 2017 | 86 | 577222 | 96.2 |
The second class of joins are filtering joins, which select specific cases from the left table based on whether they match an observation in the right table.
semi_join(): discards any cases in the left table that do not have a match in the right table. If there are multiple matches of right cases to a left case, it keeps just one copy of the left case.anti_join(): discards any cases in the left table that have a match in the right table.A particularly common employment of these joins is to use a filtered summary as a comparison to select a subset of the original cases, as follows.
Example 2.3 (semi_join to compare to a filtered summary) Compute a subset of the bikeshare table that only contains rows from the top three most populus cities in the United States.
Solution. This is where we can use semi_join to filter the rows of bikeshare:
top_three_populus<-top_250_cities_by_pop%>%
arrange(desc(pop_2020))%>%
head(n=3)
bikeshare_in_populus_cities<-bikeshare%>%
semi_join(top_three_populus)
| city | state | year | dock_ct |
|---|---|---|---|
| Chicago | IL | 2015 | 469 |
| Chicago | IL | 2016 | 582 |
| Chicago | IL | 2017 | 574 |
| Chicago | IL | 2018 | 608 |
| Chicago | IL | 2019 | 608 |
| Chicago | IL | 2020 | 612 |
| Chicago | IL | 2021 | 671 |
| Chicago | IL | 2022 | 1367 |
| Chicago | IL | 2023 | 1636 |
| Los Angeles | CA | 2016 | 104 |
| Los Angeles | CA | 2017 | 119 |
| Los Angeles | CA | 2018 | 127 |
| Los Angeles | CA | 2019 | 127 |
| Los Angeles | CA | 2020 | 213 |
| Los Angeles | CA | 2021 | 218 |
| Los Angeles | CA | 2022 | 219 |
| Los Angeles | CA | 2023 | 223 |
| New York | NY | 2015 | 507 |
| New York | NY | 2016 | 664 |
| New York | NY | 2017 | 811 |
| New York | NY | 2018 | 841 |
| New York | NY | 2019 | 841 |
| New York | NY | 2020 | 1011 |
| New York | NY | 2021 | 1566 |
| New York | NY | 2022 | 1703 |
| New York | NY | 2023 | 1954 |
Example 2.4 (anti_join) Which U.S. cities in the 250 most populus have no bikeshare program listed in the BTS data?
Solution. While semi_join keeps rows of the left table that have a match in the right table, anti_join keeps rows that do not have a match:
populus_cities_no_bike_scooter<-top_250_cities_by_pop%>%
anti_join(bikeshare)%>%
arrange(desc(pop_2020))
| city | state | pop_2020 | area_sq_miles |
|---|---|---|---|
| San Jose | CA | 1013240 | 178.3 |
| Jacksonville | FL | 949611 | 747.3 |
| Fresno | CA | 542107 | 115.2 |
| Sacramento | CA | 524943 | 98.6 |
| Mesa | AZ | 504258 | 138.7 |
| Colorado Springs | CO | 478961 | 195.4 |
Here is an additional table from Wikipedia with the top 150 United States cities by land area:
top_150_cities_by_area<-read_csv("http://faculty.olin.edu/dshuman/DS/us_top_150_cities_by_land_area.csv")%>%
janitor::clean_names()%>%
arrange(desc(land_area_mi2))
| city | st | land_area_mi2 | land_area_km2 | water_area_mi2 | water_area_km2 | total_area_mi2 | total_area_km2 | population_2020 |
|---|---|---|---|---|---|---|---|---|
| Sitka | AK | 2870.1 | 7434 | 1945.1 | 5038.0 | 4815.1 | 12471 | 8458 |
| Juneau | AK | 2704.2 | 7004 | 550.7 | 1426.0 | 3254.9 | 8430 | 32255 |
| Wrangell | AK | 2556.1 | 6620 | 920.6 | 2384.0 | 3476.7 | 9005 | 2127 |
| Anchorage | AK | 1707.0 | 4421 | 239.7 | 621.0 | 1946.7 | 5042 | 291247 |
| Tribune | KS | 778.2 | 2016 | 0.0 | 0.0 | 778.2 | 2016 | 1182 |
| Jacksonville | FL | 747.3 | 1935 | 127.2 | 329.0 | 874.5 | 2265 | 949611 |
| Anaconda | MT | 736.7 | 1908 | 4.7 | 12.0 | 741.4 | 1920 | 9421 |
| Butte | MT | 715.8 | 1854 | 0.6 | 1.6 | 716.3 | 1855 | 34494 |
| Houston | TX | 640.6 | 1659 | 31.2 | 81.0 | 671.8 | 1740 | 2304580 |
| Oklahoma City | OK | 606.5 | 1571 | 14.3 | 37.0 | 620.8 | 1608 | 681054 |
Exercise 2.1 Use your wrangling skills to answer the following questions. Hint: start by thinking about what tables you might need to join (if any), determining which is the left table and which is the right table, and identifying the corresponding variables to match.
Which of the 250 most populus cities in the United States had the most shared bicycle docking stations per square mile in 2022?
Which of the 20 most populus cities in the United States has the most land area per 100000 inhabitants?
Find a list of all of the 20 largest U.S. cities by land area that had a bikeshare program listed in the BTS data, any time between 2015 and 2023.
How many U.S. cities are in the top 150 by population, but not in the top 150 by land area? Hint: It is an even number!
bikeshare3 %>%
mutate(stations_per_sq_mile = dock_ct/area_sq_miles) %>%
filter(year==2022) %>%
arrange(desc(stations_per_sq_mile))
## # A tibble: 49 × 7
## city state year dock_ct pop_2020 area_sq_miles stations_per_sq_mile
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Washington DC 2022 684 689545 61.1 11.2
## 2 San Francisco CA 2022 494 873965 46.9 10.5
## 3 Boston MA 2022 434 675647 48.3 8.99
## 4 Oakland CA 2022 494 440646 55.9 8.84
## 5 Minneapolis MN 2022 399 429954 54 7.39
## 6 Chicago IL 2022 1367 2746388 228. 6.00
## 7 New York NY 2022 1703 8804190 300. 5.67
## 8 Miami FL 2022 160 442241 36 4.44
## 9 Buffalo NY 2022 110 278349 40.4 2.72
## 10 Honolulu HI 2022 134 350964 60.5 2.21
## # ℹ 39 more rows
head(top_250_cities_by_pop, n=20L) %>%
mutate(area_per_100k = area_sq_miles/(pop_2020/100000)) %>%
arrange(desc(area_per_100k))
## # A tibble: 20 × 5
## city state pop_2020 area_sq_miles area_per_100k
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Jacksonville FL 949611 747. 78.7
## 2 Indianapolis IN 887642 362. 40.7
## 3 Fort Worth TX 918915 347. 37.8
## 4 Charlotte NC 874579 308. 35.3
## 5 San Antonio TX 1434625 499. 34.8
## 6 Austin TX 961855 320. 33.3
## 7 Phoenix AZ 1608139 518 32.2
## 8 Houston TX 2304580 640. 27.8
## 9 Dallas TX 1304379 340. 26.0
## 10 Columbus OH 905748 220 24.3
## 11 San Diego CA 1386932 326. 23.5
## 12 Denver CO 715522 153. 21.4
## 13 San Jose CA 1013240 178. 17.6
## 14 Los Angeles CA 3898747 470. 12.0
## 15 Seattle WA 737015 83.8 11.4
## 16 Washington DC 689545 61.1 8.86
## 17 Philadelphia PA 1603797 134. 8.38
## 18 Chicago IL 2746388 228. 8.29
## 19 San Francisco CA 873965 46.9 5.37
## 20 New York NY 8804190 300. 3.41
top_150_cities_by_area %>%
semi_join(bikeshare, by="city") %>%
arrange(desc(land_area_mi2)) %>%
head(n=20L)
## # A tibble: 20 × 9
## city st land_area_mi2 land_area_km2 water_area_mi2 water_area_km2
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Houston TX 641. 1659 31.2 81
## 2 Oklahoma City OK 606. 1571 14.3 37
## 3 Phoenix AZ 518. 1342 1 2.6
## 4 San Antonio TX 499. 1292 5.8 15
## 5 Nashville TN 476. 1232 21.6 56
## 6 Los Angeles CA 470. 1219 31.1 81
## 7 Indianapolis IN 361 935 6.9 18
## 8 Fort Worth TX 350. 907 8.3 21
## 9 Dallas TX 340. 880 44 114
## 10 Austin TX 326. 845 6.6 17
## 11 San Diego CA 326. 845 46.3 120
## 12 Louisville KY 324. 840 16.5 43
## 13 Kansas City MO 315. 815 4.1 11
## 14 Charlotte NC 311. 805 2 5.2
## 15 Memphis TN 296. 767 7.6 20
## 16 El Paso TX 259. 670 0.6 1.6
## 17 Tucson AZ 242. 627 0.3 0.78
## 18 Chicago IL 228. 590 6.8 18
## 19 Huntsville AL 224. 579 1.5 3.9
## 20 Columbus OH 221. 572 5.9 15
## # ℹ 3 more variables: total_area_mi2 <dbl>, total_area_km2 <dbl>,
## # population_2020 <dbl>
top_250_cities_by_pop %>%
arrange(desc(pop_2020)) %>%
head(n=150L) %>%
anti_join(top_150_cities_by_area, by = c("city"="city", "state"="st")) %>%
filter(city!="New York")
## # A tibble: 63 × 4
## city state pop_2020 area_sq_miles
## <chr> <chr> <dbl> <dbl>
## 1 San Francisco CA 873965 46.9
## 2 Washington DC 689545 61.1
## 3 Boston MA 675647 48.3
## 4 Long Beach CA 466742 50.7
## 5 Miami FL 442241 36
## 6 Oakland CA 440646 55.9
## 7 Minneapolis MN 429954 54
## 8 Cleveland OH 372624 77.7
## 9 Honolulu HI 350964 60.5
## 10 Anaheim CA 346824 50.3
## # ℹ 53 more rows
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
Figure 3.1: A typical Capital Bikeshare station. This one is at Florida and California, next to Pleasant Pops.
Figure 3.2: One of the vans used to redistribute bicycles to different stations.
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data:2
Trips <- read_csv("http://faculty.olin.edu/dshuman/DS/2014-Q4-Trips-History-Data-Small.csv")
#Trips <- read_csv("http://faculty.olin.edu/dshuman/DS/2014-Q4-Trips-History-Data.csv")
Stations<-read_csv("http://faculty.olin.edu/dshuman/DS/DC-Stations.csv")
The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you can access the full data set of more than 600,000 events by removing -Small from the file name.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started.
Exercise 3.1 (Single variable temporal plots) Make the following plots and interpret them:
sdate. Use ggplot() and geom_density().mutate with lubridate::hour(), and lubridate::minute() to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a field where 3:30 is 3.5 and 3:45 is 3.75.ggplot(Trips, aes(sdate)) +
geom_density()
Trips <- Trips %>%
mutate(day_time = lubridate::hour(sdate) + lubridate::minute(sdate)/60)
ggplot(Trips, aes(day_time)) +
geom_density()
Trips <- Trips %>%
mutate(weekday = lubridate::wday(sdate, label=TRUE))
ggplot(Trips, aes(weekday)) +
geom_bar()
ggplot(Trips, aes(day_time)) +
geom_density()+
facet_wrap(~weekday)
::: {.solution}
On weekdays there are spikes during commuting times
:::
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). Do you think these two different categories of users show different rental behavior? How might it interact with the patterns you found in Exercise 3.1?
Exercise 3.2 (Customer segmentation) Repeat the graphic from Exercise 3.1 (d) with the following changes:
fill aesthetic for geom_density() to the client variable. You may also want to set the alpha for transparency and color=NA to suppress the outline of the density function.position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday")). What does the variable wday represent? Try to understand the code.wday and fill with client, or vice versa?ggplot(Trips) +
geom_density(aes(day_time, fill=client, alpha=0.5),
color=NA)+
facet_wrap(~weekday)
ggplot(Trips) +
geom_density(aes(day_time, fill=client, alpha=0.5),
color=NA,
position = position_stack())+
facet_wrap(~weekday)
::: {.solution}
The position stack is able to show total users in a given day, but without the stack is better at showing the contrast between casual and registered users. I think the non stacked graph is better at showing how registered users are the ones using the bikes during commuting times.
:::
Trips <- Trips %>%
mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday"))
ggplot(Trips) +
geom_density(aes(day_time, fill=client, alpha=0.5),
color=NA)+
facet_wrap(~wday)
ggplot(Trips) +
geom_density(aes(day_time, fill=wday, alpha=0.5),
color=NA)+
facet_wrap(~client)
Solution.
wday represents if it’s a weekday or a weekendwday and fill with client because I think it makes more more sense to have weekday/weekend as the overall header because the x axis is time.Exercise 3.3 (Visualization of bicycle departures by station) Use the latitude and longitude variables in Stations to make a spatial visualization of the total number of departures from each station in the Trips data. Your map can be static or interactive via leaflet. Here is a good bounding box:
dc_bbox<-c(-77.1,38.87,-76.975,38.95)
register_stadiamaps("5081c024-5bc9-46e4-b756-897ac8e430cd", write = TRUE)
station_departures <- Stations %>%
left_join(Trips, by=c("name"="sstation")) %>%
group_by(name) %>%
count() %>%
left_join(Stations, by=c("name"="name")) %>%
select(name, n, lat, long)
ggmap(get_stadiamap(bbox = dc_bbox, maptype = "stamen_toner_lite", zoom = 13)) +
geom_point(data=station_departures, aes(x=long, y=lat, size=n, alpha=0.5)) +
labs(title="Rental Bike Departures in Washington DC", size="Departures")
Exercise 3.4 Only 14.4% of the trips in our data are carried out by casual users.3 Create a map that shows which area(s) of the city have stations with a much higher percentage of departures by casual users. Interpret your map. Hint: you may want to exclude stations with low overall traffic (e.g., under 20 total departures).
casual_departures <- Stations %>%
left_join(Trips, by=c("name"="sstation")) %>%
group_by(name, client) %>%
count() %>%
pivot_wider(names_from = client, values_from = n) %>%
mutate(casual_percentage = Casual/(Casual+Registered)) %>%
left_join(Stations, by=c("name"="name")) %>%
select(name, casual_percentage, lat, long)
ggmap(get_stadiamap(bbox = dc_bbox, maptype = "stamen_toner_lite", zoom = 13)) +
geom_point(data=casual_departures, aes(x=long, y=lat, color=casual_percentage, alpha=0.5, size=3)) +
labs(title="Casual Rider Departures Washington DC", color="Percentage of Casual Riders")
::: {.solution}
There are more casual riders on the south part of the city by the river. I’m guessing this is the more touristy part of the city, downtown, by the water, etc.
:::
Exercise 3.5 (High traffic points)
as_date(sdate) converts sdate from date-time format to date format.wday (weekend/weekday), and count the total number of trips in each of the four groups. Interpret your results.station_date <- Stations %>%
left_join(Trips, by=c("name"="sstation")) %>%
mutate(date = as_date(sdate)) %>%
group_by(name, date) %>%
count() %>%
arrange(desc(n)) %>%
head(10)
station_date
## # A tibble: 10 × 3
## # Groups: name, date [10]
## name date n
## <chr> <date> <int>
## 1 Columbus Circle / Union Station 2014-11-12 11
## 2 Jefferson Dr & 14th St SW 2014-12-27 9
## 3 Lincoln Memorial 2014-10-05 9
## 4 Lincoln Memorial 2014-10-09 8
## 5 17th St & Massachusetts Ave NW 2014-10-06 7
## 6 Columbus Circle / Union Station 2014-10-02 7
## 7 Georgetown Harbor / 30th St NW 2014-10-25 7
## 8 Massachusetts Ave & Dupont Circle NW 2014-10-01 7
## 9 New Hampshire Ave & T St NW 2014-10-16 7
## 10 14th & V St NW 2014-11-07 6
busy_day_trips <- Trips %>%
mutate(date = as_date(sdate)) %>%
semi_join(station_date, by=c("sstation"="name", "date"="date"))
busy_day_trips %>%
arrange(sstation) %>%
select(sstation, client, weekday)
## # A tibble: 78 × 3
## sstation client weekday
## <chr> <chr> <ord>
## 1 14th & V St NW Registered Fri
## 2 14th & V St NW Registered Fri
## 3 14th & V St NW Registered Fri
## 4 14th & V St NW Registered Fri
## 5 14th & V St NW Registered Fri
## 6 14th & V St NW Registered Fri
## 7 17th St & Massachusetts Ave NW Registered Mon
## 8 17th St & Massachusetts Ave NW Registered Mon
## 9 17th St & Massachusetts Ave NW Registered Mon
## 10 17th St & Massachusetts Ave NW Registered Mon
## # ℹ 68 more rows
busy_day_trips %>%
group_by(wday, client) %>%
count()
## # A tibble: 4 × 3
## # Groups: wday, client [4]
## wday client n
## <chr> <chr> <int>
## 1 weekday Casual 9
## 2 weekday Registered 44
## 3 weekend Casual 20
## 4 weekend Registered 5
Solution. The busiest stations on a given day were mainly registered riders on weekdays, and then casual riders on weekends. I suspect the large number of registered riders is due to a public transit shutdown, causing bus and subway commuters to bike instaed. The higher number of casual riders on weekends is likely a large group of tourists all renting bikes from the same place together.
There is also a right_join() that adds variables in the reverse direction from the left table to the right table, but we do not really need it as we can always switch the roles of the two tables.↩︎
Important: To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.↩︎
We can compute this statistic via mean(Trips$client=="Casual").↩︎